Overview

Dataset statistics

Number of variables15
Number of observations12034
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory1.4 MiB
Average record size in memory120.0 B

Variable types

NUM9
BOOL6

Reproduction

Analysis started2020-02-16 03:25:25.350675
Analysis finished2020-02-16 03:25:55.109179
Versionpandas-profiling v2.5.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
company_id is highly correlated with receipt_id and 2 other fieldsHigh Correlation
receipt_id is highly correlated with company_id and 2 other fieldsHigh Correlation
matched_transaction_id is highly correlated with receipt_id and 2 other fieldsHigh Correlation
feature_transaction_id is highly correlated with receipt_id and 2 other fieldsHigh Correlation
TimeMappingMatch is highly correlated with DifferentPredictedTimeHigh Correlation
DifferentPredictedTime is highly correlated with TimeMappingMatchHigh Correlation
DifferentPredictedDate is highly correlated with DateMappingMatchHigh Correlation
DateMappingMatch is highly correlated with DifferentPredictedDateHigh Correlation
PredictedAmountMatch is highly skewed (γ1 = 23.44637336) Skewed
DateMappingMatch has 9068 (75.4%) zeros Zeros
AmountMappingMatch has 11225 (93.3%) zeros Zeros
DescriptionMatch has 11581 (96.2%) zeros Zeros
PredictedNameMatch has 11589 (96.3%) zeros Zeros
PredictedAmountMatch has 11989 (99.6%) zeros Zeros

Variables

receipt_id
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count1155
Unique (%)9.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean27396.26417
Minimum10000
Maximum50226
Zeros0
Zeros (%)0.0%
Memory size94.1 KiB

Quantile statistics

Minimum10000
5-th percentile10057
Q120057
median30105
Q330343
95-th percentile50129
Maximum50226
Range40226
Interquartile range (IQR)10286

Descriptive statistics

Standard deviation12037.59652
Coefficient of variation (CV)0.4393882482
Kurtosis-0.5771494461
Mean27396.26417
Median Absolute Deviation (MAD)9630.465591
Skewness0.2542750523
Sum329686643
Variance144903730
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[10000. 10000.5 10003.5 10013.5 10063. ... 50212.5 50215.5 50218.5 50222.5 50226. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
30393 25 0.2%
 
30081 25 0.2%
 
30303 25 0.2%
 
30177 23 0.2%
 
30053 23 0.2%
 
30203 23 0.2%
 
20032 22 0.2%
 
30235 22 0.2%
 
10165 22 0.2%
 
30139 22 0.2%
 
Other values (1145) 11802 98.1%
 
ValueCountFrequency (%) 
10000 20 0.2%
 
10001 7 0.1%
 
10002 7 0.1%
 
10003 8 0.1%
 
10004 20 0.2%
 
ValueCountFrequency (%) 
50226 7 0.1%
 
50225 20 0.2%
 
50224 12 0.1%
 
50223 14 0.1%
 
50222 2 < 0.1%
 

company_id
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean27247.79791
Minimum10000
Maximum50000
Zeros0
Zeros (%)0.0%
Memory size94.1 KiB

Quantile statistics

Minimum10000
5-th percentile10000
Q120000
median30000
Q330000
95-th percentile50000
Maximum50000
Range40000
Interquartile range (IQR)10000

Descriptive statistics

Standard deviation12024.54299
Coefficient of variation (CV)0.4413032946
Kurtosis-0.5670587709
Mean27247.79791
Median Absolute Deviation (MAD)9599.982201
Skewness0.2627528486
Sum327900000
Variance144589634.1
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[10000. 25000. 35000. 45000. 50000.], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
30000 4926 40.9%
 
20000 2375 19.7%
 
10000 2351 19.5%
 
50000 1383 11.5%
 
40000 999 8.3%
 
ValueCountFrequency (%) 
10000 2351 19.5%
 
20000 2375 19.7%
 
30000 4926 40.9%
 
40000 999 8.3%
 
50000 1383 11.5%
 
ValueCountFrequency (%) 
50000 1383 11.5%
 
40000 999 8.3%
 
30000 4926 40.9%
 
20000 2375 19.7%
 
10000 2351 19.5%
 

matched_transaction_id
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count1155
Unique (%)9.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean27773.90818
Minimum10112
Maximum50408
Zeros0
Zeros (%)0.0%
Memory size94.1 KiB

Quantile statistics

Minimum10112
5-th percentile10294
Q120259
median30524
Q331318
95-th percentile50224.35
Maximum50408
Range40296
Interquartile range (IQR)11059

Descriptive statistics

Standard deviation12014.61197
Coefficient of variation (CV)0.4325862926
Kurtosis-0.6242298912
Mean27773.90818
Median Absolute Deviation (MAD)9711.056015
Skewness0.2123450504
Sum334231211
Variance144350900.8
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[10112. 10115.5 10123.5 10125.5 10136. ... 50345. 50358.5 50373. 50407.5 50408. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
31463 25 0.2%
 
31462 25 0.2%
 
31460 25 0.2%
 
30402 23 0.2%
 
30395 23 0.2%
 
30398 23 0.2%
 
31121 22 0.2%
 
10331 22 0.2%
 
30393 22 0.2%
 
20265 22 0.2%
 
Other values (1145) 11802 98.1%
 
ValueCountFrequency (%) 
10112 6 < 0.1%
 
10113 6 < 0.1%
 
10114 20 0.2%
 
10115 20 0.2%
 
10116 4 < 0.1%
 
ValueCountFrequency (%) 
50408 7 0.1%
 
50407 7 0.1%
 
50379 2 < 0.1%
 
50378 2 < 0.1%
 
50376 2 < 0.1%
 

feature_transaction_id
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count2132
Unique (%)17.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean27547.60894
Minimum10000
Maximum50413
Zeros0
Zeros (%)0.0%
Memory size94.1 KiB

Quantile statistics

Minimum10000
5-th percentile10019
Q120010
median30104
Q331109
95-th percentile50144
Maximum50413
Range40413
Interquartile range (IQR)11099

Descriptive statistics

Standard deviation12037.73407
Coefficient of variation (CV)0.436979271
Kurtosis-0.6063972023
Mean27547.60894
Median Absolute Deviation (MAD)9690.125122
Skewness0.2337546208
Sum331507926
Variance144907041.6
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[10000. 10000.5 10010.5 10017.5 10028.5 ... 50327. 50340.5 50369.5 50407.5 50413. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
30003 117 1.0%
 
30002 117 1.0%
 
30077 116 1.0%
 
30078 115 1.0%
 
30158 115 1.0%
 
30159 114 0.9%
 
30071 114 0.9%
 
30083 114 0.9%
 
30012 114 0.9%
 
30096 112 0.9%
 
Other values (2122) 10886 90.5%
 
ValueCountFrequency (%) 
10000 53 0.4%
 
10001 52 0.4%
 
10003 52 0.4%
 
10004 52 0.4%
 
10005 52 0.4%
 
ValueCountFrequency (%) 
50413 2 < 0.1%
 
50412 2 < 0.1%
 
50411 2 < 0.1%
 
50410 2 < 0.1%
 
50409 2 < 0.1%
 

DateMappingMatch
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS
Distinct count11
Unique (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.2179013628
Minimum0
Maximum1
Zeros9068
Zeros (%)75.4%
Memory size94.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0.95
Maximum1
Range1
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.3845348501
Coefficient of variation (CV)1.764719803
Kurtosis-0.429122464
Mean0.2179013628
Median Absolute Deviation (MAD)0.3283911514
Skewness1.23023108
Sum2622.225
Variance0.1478670509
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.5375 0.6 0.6875 0.7375 ... 0.8375 0.875 0.925 0.975 1. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 9068 75.4%
 
0.95 1636 13.6%
 
0.85 571 4.7%
 
0.9 217 1.8%
 
0.65 194 1.6%
 
0.825 179 1.5%
 
0.55 98 0.8%
 
1 36 0.3%
 
0.75 21 0.2%
 
0.525 11 0.1%
 
ValueCountFrequency (%) 
0 9068 75.4%
 
0.525 11 0.1%
 
0.55 98 0.8%
 
0.65 194 1.6%
 
0.725 3 < 0.1%
 
ValueCountFrequency (%) 
1 36 0.3%
 
0.95 1636 13.6%
 
0.9 217 1.8%
 
0.85 571 4.7%
 
0.825 179 1.5%
 

AmountMappingMatch
Real number (ℝ≥0)

ZEROS
Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.03166029583
Minimum0
Maximum0.9
Zeros11225
Zeros (%)93.3%
Memory size94.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0.4
Maximum0.9
Range0.9
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.1226109442
Coefficient of variation (CV)3.872703681
Kurtosis15.71977645
Mean0.03166029583
Median Absolute Deviation (MAD)0.05906378938
Skewness3.991207947
Sum381
Variance0.01503344364
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.5 0.65 0.8 0.9 ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 11225 93.3%
 
0.4 615 5.1%
 
0.7 159 1.3%
 
0.6 26 0.2%
 
0.9 9 0.1%
 
ValueCountFrequency (%) 
0 11225 93.3%
 
0.4 615 5.1%
 
0.6 26 0.2%
 
0.7 159 1.3%
 
0.9 9 0.1%
 
ValueCountFrequency (%) 
0.9 9 0.1%
 
0.7 159 1.3%
 
0.6 26 0.2%
 
0.4 615 5.1%
 
0 11225 93.3%
 

DescriptionMatch
Real number (ℝ≥0)

ZEROS
Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.02152235333
Minimum0
Maximum0.8
Zeros11581
Zeros (%)96.2%
Memory size94.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum0.8
Range0.8
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.1169950184
Coefficient of variation (CV)5.435977032
Kurtosis32.76678298
Mean0.02152235333
Median Absolute Deviation (MAD)0.04142435997
Skewness5.742049975
Sum259
Variance0.01368783433
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.3 0.5 0.7 0.8], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 11581 96.2%
 
0.8 193 1.6%
 
0.4 143 1.2%
 
0.6 60 0.5%
 
0.2 57 0.5%
 
ValueCountFrequency (%) 
0 11581 96.2%
 
0.2 57 0.5%
 
0.4 143 1.2%
 
0.6 60 0.5%
 
0.8 193 1.6%
 
ValueCountFrequency (%) 
0.8 193 1.6%
 
0.6 60 0.5%
 
0.4 143 1.2%
 
0.2 57 0.5%
 
0 11581 96.2%
 

DifferentPredictedTime
Boolean

HIGH CORRELATION
Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size94.1 KiB
1
11871
0
 
163
ValueCountFrequency (%) 
1 11871 98.6%
 
0 163 1.4%
 

TimeMappingMatch
Boolean

HIGH CORRELATION
Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size94.1 KiB
0
11867
1
 
167
ValueCountFrequency (%) 
0 11867 98.6%
 
1 167 1.4%
 

PredictedNameMatch
Real number (ℝ≥0)

ZEROS
Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.02421472495
Minimum0
Maximum0.8
Zeros11589
Zeros (%)96.3%
Memory size94.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum0.8
Range0.8
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.1286460537
Coefficient of variation (CV)5.312720007
Kurtosis27.95876714
Mean0.02421472495
Median Absolute Deviation (MAD)0.04663859854
Skewness5.38729344
Sum291.4
Variance0.01654980713
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.1 0.3 0.7 0.8], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 11589 96.3%
 
0.8 251 2.1%
 
0.4 91 0.8%
 
0.6 84 0.7%
 
0.2 19 0.2%
 
ValueCountFrequency (%) 
0 11589 96.3%
 
0.2 19 0.2%
 
0.4 91 0.8%
 
0.6 84 0.7%
 
0.8 251 2.1%
 
ValueCountFrequency (%) 
0.8 251 2.1%
 
0.6 84 0.7%
 
0.4 91 0.8%
 
0.2 19 0.2%
 
0 11589 96.3%
 
Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size94.1 KiB
0
11578
1
 
456
ValueCountFrequency (%) 
0 11578 96.2%
 
1 456 3.8%
 

DifferentPredictedDate
Boolean

HIGH CORRELATION
Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size94.1 KiB
1
9068
0
2966
ValueCountFrequency (%) 
1 9068 75.4%
 
0 2966 24.6%
 

PredictedAmountMatch
Real number (ℝ≥0)

SKEWED
ZEROS
Distinct count6
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.001005484461
Minimum0
Maximum0.6
Zeros11989
Zeros (%)99.6%
Memory size94.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum0.6
Range0.6
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.02013383633
Coefficient of variation (CV)20.0240154
Kurtosis579.2907206
Mean0.001005484461
Median Absolute Deviation (MAD)0.002003449094
Skewness23.44637336
Sum12.1
Variance0.0004053713652
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.05 0.15 0.3 0.6 ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 11989 99.6%
 
0.1 24 0.2%
 
0.5 9 0.1%
 
0.4 8 0.1%
 
0.6 3 < 0.1%
 
0.2 1 < 0.1%
 
ValueCountFrequency (%) 
0 11989 99.6%
 
0.1 24 0.2%
 
0.2 1 < 0.1%
 
0.4 8 0.1%
 
0.5 9 0.1%
 
ValueCountFrequency (%) 
0.6 3 < 0.1%
 
0.5 9 0.1%
 
0.4 8 0.1%
 
0.2 1 < 0.1%
 
0.1 24 0.2%
 
Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size94.1 KiB
0
11113
1
 
921
ValueCountFrequency (%) 
0 11113 92.3%
 
1 921 7.7%
 

flag_match
Boolean

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size94.1 KiB
0
11177
1
 
857
ValueCountFrequency (%) 
0 11177 92.9%
 
1 857 7.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

receipt_idcompany_idmatched_transaction_idfeature_transaction_idDateMappingMatchAmountMappingMatchDescriptionMatchDifferentPredictedTimeTimeMappingMatchPredictedNameMatchShortNameMatchDifferentPredictedDatePredictedAmountMatchPredictedTimeCloseMatchflag_match
0100001000010468100000.00.00.01.00.00.00.01.00.00.00
1100001000010468100010.00.00.01.00.00.00.01.00.00.00
2100001000010468100030.00.00.01.00.00.00.01.00.00.00
3100001000010468100040.00.00.01.00.00.00.01.00.00.00
4100001000010468100050.00.00.01.00.00.00.01.00.00.00
5100001000010468100060.00.00.01.00.00.00.01.00.00.00
6100001000010468100080.00.00.01.00.00.00.01.00.00.00
7100001000010468100090.00.00.01.00.00.00.01.00.00.00
8100001000010468100100.00.00.01.00.00.00.01.00.00.00
9100001000010468100110.00.00.01.00.00.00.01.00.00.00

Last rows

receipt_idcompany_idmatched_transaction_idfeature_transaction_idDateMappingMatchAmountMappingMatchDescriptionMatchDifferentPredictedTimeTimeMappingMatchPredictedNameMatchShortNameMatchDifferentPredictedDatePredictedAmountMatchPredictedTimeCloseMatchflag_match
12024502255000050037500220.000.00.01.00.00.00.01.00.00.00
12025502255000050037500240.000.00.01.00.00.00.01.00.00.00
12026502255000050037500260.000.00.01.00.00.00.01.00.00.00
12027502265000050368500700.000.00.01.00.00.00.01.00.00.00
12028502265000050368500720.000.00.01.00.00.00.01.00.00.00
12029502265000050368500740.650.00.01.00.00.00.00.00.00.00
12030502265000050368500750.650.00.01.00.00.00.00.00.00.00
12031502265000050368503660.000.00.01.00.00.00.01.00.01.00
12032502265000050368503670.000.00.01.00.00.00.01.00.00.00
12033502265000050368503680.950.00.01.00.00.00.00.00.01.01